Scheduling Load Operations on VLIW Machines
نویسندگان
چکیده
There continues to be a widening gap between processor speeds and memory access time. This gap is seen in systems ranging from embedded computing systems to high-performance supercomputing systems. In this paper, we present an instruction scheduling algorithm that can be targetted towards VLIW architectures commonly found in embedded systems and high-performance workstations i.e. Itanium. The goal of this paper is to present a simple instruction scheduling algorithm that does not require substantial hardware support to address the scheduling of load operations to mask the latency of delinquent loads; which are associated with high miss rates and very long average latencies. Our algorithm is named Cache Sensitive Scheduling (CSS). CSS is designed to be sensitive to the varying memory latencies of load operations, and compensate for those latencies within the instruction schedule by masking the typically long latencies of load operations with useful operations to reduce stall penalties. CSS can extend a rank-function based scheduler with two additional components to intelligently incorporate the profiled average latency of an operation, and the latencies of its predecessors. Our results show that these additional components are effective in generating schedules that are more sensitive to the latencies of load instructions. To support the selection and relative weight of our rank function components we use multivariate statistical analysis to determine the degree of correlation between our rank components and the execution time of the program. In our experiments with a VLIW parameterized compiler-simulator infrastructure using a variety of memory hierarchy configurations; we were able to achieve 20% speedups and 44% stall cycle reductions over a more conventional critical path scheduling algorithm.
منابع مشابه
Aligned Scheduling: Cache-Efficient Instruction Scheduling for VLIW Processors
The performance of statically scheduled VLIW processors is highly sensitive to the instruction scheduling performed by the compiler. In this work we identify a major deficiency in existing instruction scheduling for VLIW processors. Unlike most dynamically scheduled processors, a VLIW processor with no load-use hardware interlocks will completely stall upon a cache-miss of any of the operations...
متن کاملCache Sensitive Instruction Scheduling
The processor speeds continue to improve at a faster rate than the memory access times. The issue of data locality is still unsolved, and continues to be a problem given the widening gap between processor speeds and memory access times. Compiler research has chosen to address this problem in many directions including source code transformations of loops, static data reorganization, dynamic data...
متن کاملOptimality of the flexible job shop scheduling system based on Gravitational Search Algorithm
The Flexible Job Shop Scheduling Problem (FJSP) is one of the most general and difficult of all traditional scheduling problems. The Flexible Job Shop Problem (FJSP) is an extension of the classical job shop scheduling problem which allows an operation to be processed by any machine from a given set. The problem is to assign each operation to a machine and to order the operations on the machine...
متن کاملOptimality of the flexible job shop scheduling system based on Gravitational Search Algorithm
The Flexible Job Shop Scheduling Problem (FJSP) is one of the most general and difficult of all traditional scheduling problems. The Flexible Job Shop Problem (FJSP) is an extension of the classical job shop scheduling problem which allows an operation to be processed by any machine from a given set. The problem is to assign each operation to a machine and to order the operations on the machine...
متن کاملExploring Energy-Performance Trade-Offs for Heterogeneous Interconnect Clustered VLIW Processors
Clustered architecture processors are preferred for embedded systems because centralized register file architectures scale poorly in terms of clock rate, chip area, and power consumption. Although clustering helps by improving clock speed, reducing energy consumption of the logic, and making design simpler, it introduces extra overheads by way of inter-cluster communication. This communication ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004